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ABSTRACT 


Users of Amazon's online shopping service are allowed to leave feedback for 
the items they buy. Amazon makes no effort to monitor or limit the scope of 
these reviews. Although the amount of reviews for various items varies, the 
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reviews provide easily accessible and abundant data for a variety of Published in 

applications. This paper aims to apply and expand existing natural language International Journal 

processing and sentiment analysis research to data obtained from Amazon. of Trend in Scientific 

The number of stars given to a product by a user is used as training data for Research and ay Ets 
supervised machine learning. Since more people are dependent on online Development (ijtsrd), OR a 
products these days, the value of a review is increasing. Before making a ISSN: 2456-6470, IJTSRD42372 
purchase, a buyer must read thousands of reviews to fully comprehend a Volume-5 | Issue-4, = 

product. In this day and age of machine learning, however, sorting through June 2021, pp.720-723, URL: 


thousands of comments and learning from them would be much easier if a 
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model was used to polarize and learn from them. We used supervised learning 


to polarize a massive Amazon dataset and achieve satisfactory accuracy. 
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INTRODUCTION 

As online marketplaces have grown in popularity over the 
years, online retailers and vendors have encouraged their 
customers to share their thoughts on the items they've 
purchased. Thousands of reviews are written every day on 
the Internet about a wide range of products, programmes, 
and locations. As a result, the Internet has surpassed all 
other sources for collecting information and opinions on a 
product or service. 


The Internet has revolutionized the way we purchase 
products. Wherever product testing is not feasible in the 
retail e-commerce environment of online marketplace. 
Furthermore, in today's retail sale environment, a large 
number of new products are introduced on a regular basis. 
As aresult, consumers can rely heavily on product feedback 
to shape their opinions in preparation for a more complex 
cognitive process during the purchasing process. Users, on 
the other hand, always find looking out and comparing text 
reviews to be challenging. As a result, we want a higher 
numerical rating system that is backed up by feedback, so 
that consumers can easily make a buying decision. 


Clients can require the use of a score device at some point 
during their decision-making process in order to locate 
useful feedback as quickly as possible. As a result, models 
that can predict a person's score based on a textual content 
assessment are critical. Obtaining a common sense of a 
textual evaluation may want to enhance customer service. It 
can also help businesses increase sales and develop their 
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products by gaining a better understanding of what their 
customers want. 


The Amazon electronic product evaluation dataset was taken 
into accounts. The evaluations and ratings provided by 
customers to exceptional products, as well as reviews about 
the customer's product(s), were also taken into accounts. 


LITERATURE SURVEY 

Sentiment analysis has gotten a lot of attention in recent 
years thanks to the abundance of online reviews. As a result, 
numerous studies have been conducted in this area. Some of 
the most relevant research works to this thesis are discussed 
in this section. 


SVM was tested for text classification by Joachims (1998), 
who found that it performed well in all experiments with 
lower error levels than other classification methods. 


With the assistance of SVM and Naive Bayes and maximum 
entropy classification, Pang, Lee, and Vaithyanathan (2002) 
attempted supervised learning for classifying movie reviews 
into two groups, positive and negative. In terms of precision, 
all three methods performed admirably. In this analysis, they 
experimented with different features and discovered that 
when a bag of words was used as a feature in the classifiers, 
the machine learning algorithms performed better. 


Three supervised machine learning algorithms, Naive Bayes, 
SVM, and N-gram model, were tested on online feedback 
about various travel destinations around the world in a 
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recent survey conducted by Ye etal. (2009). They discovered 
in this study that well-trained machine learning algorithms 
work exceptionally well for classification of travel 
destination reviews in terms of accuracy. They also showed 
that the SVM and N-gram models outperformed the Naive 
Bayes system. However, increasing the number of training 
data sets decreased the gap between the algorithms 
significantly. 


Chaovalit and Zhou (2005) compared a supervised machine 
learning algorithm to an unsupervised approach to movie 
review called Semantic orientation, and found that the 
supervised approach was more efficient than the 
unsupervised form. 


Naive Bayes and SVM are two of the most widely used 
methods in sentiment classification issues, according to 
several studies (Joachims 1998; Pang et al. 2002; Ye et al. 
2009). As a result, this study attempts to apply supervised 
machine learning algorithms such as Naive Bayes and SVM to 
Amazon's beauty product reviews. 


PROPOSED SYSTEM 

The method entails gathering product-based datasets from 
various E-commerce sites such as amazon.com, epinion.com, 
and others. The feedback is received on items such as 
phones, iPods, and other electronic devices. The aim of this 
project is to use algorithms like random forest, decision tree, 
and SVM to evaluate and forecast product reviews by 
classifying them as positive, negative, or neutral. We conduct 
pre-processing, extract features on which comments are 
made, measure polarity of feedback, and plot a graph for the 
result since the input is about unstructured product reviews. 
Dealing with negation is also covered in the results. For 
instance, "the Nokia phone is not bad" is a positive review 
despite the negative word ‘not." The approach flow diagram 
as shown below, and the subsections are explained in detail 
in the following subsections. 

A Positive Review 








Sentence from 


; Classifier Model 
review 


[a Negative Review 


Sentiment Classification Algorithm: 

Sentiment analysis, also known as opinion mining, is a 
problem in natural language processing (NLP) that entails 
recognizing and extracting subjective knowledge from text 
sources. The aim of sentiment classification is to interpret 
user feedback and categorize them as positive or negative, 
without requiring the system to fully comprehend the 
semantics of each phrase or text. 


Sentiment analysis is becoming a powerful method for 
monitoring and analyzing consumer sentiment as people 
Share their thoughts and feelings more freely than ever 
before. Brands can learn what makes consumers happy or 
sad by automatically analyzing consumer reviews such as 
survey responses and social media interactions. This allows 
them to tailor goods and services to their customers’ specific 
requirements. 


Different areas, such as movie reviews, travel destination 
reviews, and product reviews, have been attempted by 
sentiment classification. 


Random forest Classifier (RFC) 
Random Forest is a concept for putting together decision 
trees that can be obtained by combining multiple decision 
trees. We can run into issues like outlier data or noisy data 
while using single tree classifiers, such as decision tree 
classifiers, which can affect the performance of the classifier 
function, while Random Forest as a classifier provides 
randomness and is therefore highly resistant to noise and 
outliers. This classifier produces two different forms of 
randomness: data randomness and function randomness. 
This classifier has anumber of hyper parameters because it's 
used to combine multiple Decision Trees, such as: 
> Howmany trees should be built in the Decision Forest? 
> What is the maximum number of features that can be 
selected at random? 
> The maximum height of each tree. 


Since it uses the concepts of bootstrapping and bagging, 
Random Forest is thought to be a reliable and accurate 
classifier. 


Support vector machine (SVM) 

Support vector machines (SVMs) are a type of supervised 
learning system that can be used to solve sentiment 
classification problems (Cristianini & ShaweTaylor 2000). 
This approach positions marked training data on a decision 
plane, then uses an algorithm to create an optimal 
hyperplane that divides the data into groups or classes. As 
shown in Figure 1, the best hyperplane is the one that 
separates the groups by the largest margin. This is done by 
choosing a hyperplane that is the furthest away from the 
nearest data on each class (Berk 2016). “The groups are not 
separated in H1. H2 has a slight advantage, but only by a 
small margin. H3 divides them by the greatest possible 
margin.” Weinberg, Zack (2012). 





Fig1: Support Vector Machine 


Logistic Regression Classifier (LRC) 

The likelihood of an outcome with only two possible values 
is predicted using logistic regression (i.e. a dichotomy). One 
Or more predictors are used to make the prediction 
(numerical and categorical). For two reasons, linear 
regression is ineffective for predicting the value of a binary 
variable: 


Values outside the appropriate range would be predicted by 
a linear regression (e.g. predicting probabilities outside the 
range 0 to 1) 


The residuals would not necessarily spread around the 
expected axis since dichotomous experiments could only 
have one of two potential values for each experiment. 
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A logistic regression, on the other hand, yields a logistic 
curve with values ranging from 0 to 1. In logistic regression, 
rather than using the probability, the usual logarithm of the 
target variable's "odds" is used to construct the curve. 
Furthermore, the predictors do not have to be normally 
distributed or have the same variance in and category to be 
efficient. 


Decision Tree Classifier (DTC) 

A hierarchical tree structure with attributes represented by 
decision nodes and attribute values represented by edges. 
The creation of decision rules for classifying new data 
instances is made possible by this tree-like representation. 


A decision tree is a tool for making decisions that uses a tree- 
like model of decisions and their possible outcomes, such as 
chance event outcomes, resource costs, and utility. It's one 
way Of displaying an algorithm that is completely made up of 
conditional control statements. 


Result and Discussion 

The predictive accuracy of the models is calculated after 
testing and training the dataset to decide which model is the 
best classifier for classifying feedback. The SVM model, as 
seen in the table, has the best predictive accuracy of the four 
models, whereas the Decision Tree model has the worst 
predictive accuracy. 


WONT EET Te Accuracy 


Logistic Regression Classifier | 93.92% 


Support Vector Machine 93.94% 
Random Forest Classifier 93.50% 
Decision Tree Classifier 90.10% 





> After a few arbitrary feedbacks, it seems that our 
features are working properly with Positive, Neutral, 
and Negative outcome. 

> We can also see that our Support Vector Machine 
Classifier has improved to a level of 94.08 percent 
accuracy after running the grid quest. 


from sklearn.metrics import classification_report 
from sklearn.metrics import accuracy_score 


print(classification_report(X_test_targetSentiment, predictedGS_clf_LinearSVC_pipe)) 
print('Accuracy: {}'. format(accuracy_score(X_test_targetSentiment, predictedGS_clf_LinearSVC_pipe))) 


precision recall fi-score support 

8.08 8.088 8.88 5 

Negative 8.67 6.25 8.36 156 
Neutral 8.47 6.11 6.18 292 
Positive 6.95 1.08 6.97 6473 
accuracy 8.94 6926 
macro avg @.52 8.34 8.38 6926 
weighted avg @.92 8.94 @.92 6926 


Accuracy: 6.9408027721628646 


Conclusion and Future Work 

Sentiment analysis is the process of recognizing and 
aggregating user sentiment or opinions. The method of 
deciding whether the polarity of text in a document or 
sentence is positive, negative, or neutral is known as 
sentiment analysis. We can see that four approaches have 
been compared, and a result has been calculated for 
approaches on the product review dataset. The accuracy of 
Logistic Regression is found to be 93.92 %, SVM is found to 
be 93.94 %, Decision Tree is found to be 90.10 %, and 
Random Forest is found to be 93.50 %. Among the four 
models, the SVM model has the highest predictive accuracy. 
We can see that text files that are too big take a long time to 
process. Automatic sentimental analysis is a powerful tool 
for detecting and forecasting current and future patterns. 
While opinions at the feature level have been sought, there 
are still many limitations that can be explored further. The 
potential for future development - 


Providing product reviews in a variety of languages. 
Addressing the issue of slang mapping. 

Dealing with sarcastically expressed views. 

Identifying comparative views and determining which of 
the two products under consideration is the best. 

> Dealing with anaphora resolution, which is what the 
opinion is really about. 


VVV WV 


In the future, the work could be expanded to conduct 
multiclass classification of reviews, which would give 
consumers a clearer picture of the review's essence, allowing 
them to make better product decisions. It can also be used to 
predict a product's ranking based on the review. This would 
provide consumers with a trustworthy rating because the 
product's rating and the sentiment of the review will often 
contradict each other. The proposed job extension would be 
extremely beneficial to the e-commerce industry by 
increasing customer loyalty and confidence. 
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